Skip to content

fix: resolve memory leak and process crashing during concurrent PDF ingestion#586

Merged
param20h merged 3 commits into
param20h:devfrom
knoxiboy:fix/issue-565-ingestion-concurrency
Jun 28, 2026
Merged

fix: resolve memory leak and process crashing during concurrent PDF ingestion#586
param20h merged 3 commits into
param20h:devfrom
knoxiboy:fix/issue-565-ingestion-concurrency

Conversation

@knoxiboy

@knoxiboy knoxiboy commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

🔗 Related Issue

Closes #565


📝 What does this PR do?

Resolves process crashing and memory leaks during concurrent PDF ingestion:

  1. Implements concurrency throttling (semaphore) to restrict parallel layout parser and file extraction processing to a maximum of 3 concurrent tasks.
  2. Ensures all PDF reader file handles and image-extraction buffers are fully closed and cleaned up in try...finally context managers in AdvancedPDFParser and tasks.
  3. Triggers explicit garbage collection (gc.collect()) after layout analysis/extraction operations per page to avoid memory leaks.

🗂️ Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 🔧 Refactor / code cleanup
  • 📝 Documentation update
  • 🎨 UI / styling change
  • ⚙️ CI / tooling / config change
  • 🧪 Tests

🧪 How was this tested?

  • Ran the backend locally
  • Tested document uploads manually

📸 Screenshots (if UI change)


⚠️ Anything to flag for reviewers?

None.


✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@knoxiboy knoxiboy requested a review from param20h as a code owner June 13, 2026 12:22
@knoxiboy

Copy link
Copy Markdown
Contributor Author

@param20h Please review and merge this pr

@param20h param20h merged commit 3cc8a97 into param20h:dev Jun 28, 2026
@github-actions github-actions Bot added gssoc GirlScript Summer of Code 2026 issue/PR gssoc:approved Approved for GSSoC base points (+50 pts) mentor:param20h Mentor for this PR labels Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc:approved Approved for GSSoC base points (+50 pts) gssoc GirlScript Summer of Code 2026 issue/PR level:intermediate +35 pts mentor:param20h Mentor for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Resolve Memory Leak and Process Crashing during Concurrent PDF Ingestion

2 participants